\(\newcommand{\mathds}[1]{\mathrm{I\hspace{-0.7mm}#1}}\) \(\newcommand{\bm}[1]{\boldsymbol{#1}}\) \(\newcommand{\bms}[1]{\boldsymbol{\scriptsize #1}}\) \(\newcommand{\proper}[1]{\text{#1}}\) \(\newcommand{\pE}{\proper{E}}\) \(\newcommand{\pV}{\proper{Var}}\) \(\newcommand{\pCov}{\proper{Cov}}\) \(\newcommand{\pACF}{\proper{ACF}}\) \(\newcommand{\I}{\bm{\mathcal{I}}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\pP}{\proper{P}}\) \(\newcommand{\pAIC}{\textsf{AIC}}\) \(\DeclareMathOperator{\diag}{diag}\)

2  Random Sample and Sampling Distributions

2.1 Random sample

Statistics is the science of collecting, analysing, and interpreting data. The earliest applications of statistics were on demographic and economic measures and were driven by the state (from where the name “statistics” comes).

We use statistics when we want to draw conclusions about a set of individuals which we are unable to examine in its entirety. We then define the population as the set of individuals that we want to draw conclusions about while the sample is defined as the portion of the population that we actually examine. The number of individuals in the sample corresponds to the sample size. The measured characteristic from each individual in the sample is a random variable and the collection of characteristics from all individuals in the sample is called a random sample. Each element in the random sample is an observation from the same population. The set of all possible values of these random variables is called the sample space.

Often we wish to measure some unknown characteristic of the population. A characteristic of the population is called a parameter. The set of all possible values of the parameters is called the parameter space. We use the sample to infer the value of the parameter. Any quantity calculated from the sample is called a statistic. A statistic is therefore a random variable and its distribution is called the sampling distribution.

Example 2.1 Market research organisations conduct opinion polls regularly. Figure 1.1 shows the result of such a poll. This poll was conducted by the firm YouGov on 15 October 2021. The question asked participants to state whether they felt older, the same, or younger than their real age. We can see that 4621 adults from Great Britain responded to this question, so the sample size is \(n=4621\). This number is significantly lower than the adult population of Great Britain, but it would have been impractical for YouGov to poll every adult.

From the results of the poll we can see that 16% of the responders feel older than their real age, 32% feel the same, and 47% feel younger. The remaining 5% said they don’t know. Although these results are derived from the sample, if we assume that the sample is properly chosen, then we can claim that the corresponding proportions in the whole population would be similar.

Bar chart showing how people feel about their age: 47% younger, 32% same, 16% older, 5% don’t know.
Figure 2.1: An example of a poll Source: YouGov

Example 2.2 The Office for National Statistics (ONS) wishes to measure the unemployment rate in the UK. To that end, it chooses people of working age within the UK and asks them whether they are employed or seeking employment. The proportion among those asked who are seeking employment can be used to estimate the unemployment rate. Figure 1.2 shows a typical warning appearing on ONS’s webpage regarding uncertainty in their estimates of population measures.

In this example the population consists of all individuals able to work in the UK. The parameter we wish to estimate is the unemployment rate \(p\) which is a proportion so the parameter space is the set \([0,1]\). Because the ONS cannot ask every individual, it asks a subset of the population. The individuals asked consist of the sample. The proportion in the sample seeking employment is a statistic because it is calculated from the sample and not the whole population.

Suppose \(n\) individuals were asked and let \(X_i\) denote the response of the \(i\)th individual, \(i=1,\dots,n\). We let \(X_i=1\) if the \(i\)th individual is seeking employment and \(0\) if not so in this case the sample space is the set \(\{0,1\}\). The random sample is the set \(\{X_1,\dots,X_n\}\). The proportion in the sample is also the mean of the \(X_i\)’s, denoted by \(\bar X\). Each \(X_i\) is distributed as \(X_i\sim\text{Bernoulli}(p)\), so the sampling distribution of \(\bar X\) is the distribution of the sample proportion, \(\text{Bin}(n,p)/n\).

A yellow banner with a warning icon next to the text ‘Estimates are subject to sampling variability.’
Figure 2.2: ONS uncertainty note (Source: www.ons.gov.uk)
NoteRandom sample

Definition 2.1 The random variables \(X_1,\ldots,X_n\) are called a random sample of size \(n\) from the population \(f(x\mid\theta)\) depending on a parameter \(\theta\) if \(X_1,\ldots,X_n\) are mutually independent random variables and the probability density/mass function (pdf/pmf) of each \(X_i\) is the same function \(f(x\mid\theta)\). The variables \(X_1,\ldots,X_n\) are also called independent and identically distributed (iid) random variables. We write \(X_1,\ldots,X_n\;\text{iid}\sim f(x\mid\theta)\).

Often we are interested in the joint distribution of our sample. Let \(X_1,\ldots,X_n\;\text{iid}\sim f(x\mid\theta)\). Then the joint pdf/pmf of \(X_1,\ldots,X_n\) is \[ f(x_1,\ldots,x_n\mid\theta) = f(x_1\mid\theta)\times\cdots\times f(x_n\mid\theta) = \prod_{i=1}^n f(x_i\mid\theta), \] where the first equality is true because the random variables are mutually independent.

Example 2.3 Let \(X_1,\ldots,X_n\;\text{iid}\sim \text{Exponential}(\mu)\), where \(\mu\) denotes the mean of the distribution. For example \(X_1,\ldots,X_n\) may correspond to the failure times (measured in years) for \(n\) identical circuit boards that are put to test and used until they fail and \(\mu\) denotes the average lifetime. Note that with this notation, the rate parameter is \(\lambda=1/\mu\).

Each \(X_i\) has pdf \(f(x\mid\mu) = \frac{1}{\mu}\exp\!\left(-\frac{x}{\mu}\right)\), so the joint pdf of the sample is \[ f(x_1,\ldots,x_n\mid\mu) = \prod_{i=1}^n f(x_i\mid\mu) = \frac{1}{\mu^n}\exp\!\left(-\frac{1}{\mu}\sum_{i=1}^n x_i\right). \]

2.2 Statistics and their sampling distributions

In statistical inference, we are interested in describing the distribution of the population. In most cases, a suitable calculation using the sampled values can help.

NoteStatistic and its sampling distribution

Definition 2.2 Let \(X_1,\ldots,X_n\;\text{iid}\sim f(x\mid\theta)\). A function \(T=T(X_1,\ldots,X_n)\) of the variables \(X_1,\ldots,X_n\), which does not depend on \(\theta\), is called a statistic. The statistic is itself a random variable. The probability distribution of \(T\) is called its sampling distribution.

In other words, any quantity that is calculated using the sample is a statistic. Another way to think of the sampling distribution is as the distribution of all possible values of \(T\) for all possible random samples of size \(n\) from the population \(f(x\mid\theta)\).

Example 2.4 Let \(X_1,\ldots,X_n\) be a random sample of size \(n\). Two of the most frequently used statistics are the sample mean, \(\bar X\), and the sample variance \(S^{2}\) defined by \[ \bar X = \frac{1}{n}\sum_{i=1}^n X_i, \qquad S^{2} = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar X)^2. \]

Suppose \(X_1,\ldots,X_n\;\text{iid}\sim \mathcal N(\mu,\sigma^2)\). Then, the sampling distribution of \(\bar X\) is \(\bar X \sim \mathcal N(\mu,\sigma^2/n)\), i.e., the normal distribution with mean \(\mu\) and variance \(\sigma^2/n\), and the sampling distribution of \(S^{2}\) is \(\frac{(n-1)S^{2}}{\sigma^2}\sim \chi^2_{n-1}\), i.e., the chi-squared distribution with \(n-1\) degrees of freedom times the constant \(\sigma^2/(n-1)\). Moreover, \(\bar X\) and \(S^{2}\) are independent in the case of normal populations.

The sampling distribution is not always easy to derive, either because the distribution of the population is unknown or because the statistic does not have a straightforward expression. Sometimes we can state asymptotic results as the sample size increases.

ImportantLaw of large numbers

Theorem 2.1 Let \(X_1,\ldots,X_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2<\infty\). Then, the sample mean \(\bar X\) approximates the population mean \(\mu\) when the sample size \(n\) is large.

Formally, for any small error \(\varepsilon>0\), \[ \Pr\left(\,|\bar X-\mu|\ge \varepsilon\,\right) \to 0 \quad \text{as } n\to\infty. \]

TipNot examinable

Proof. This is easily proved by Chebyshev’s inequality: for any random variable \(Y\) with variance, \(\Pr\{|Y|\ge r\} \le \frac{\operatorname{Var}(Y)}{r^2}\) for all \(r>0\). Hence, \[ \Pr\{\,|\bar X-\mu|\ge \varepsilon\,\} \le \frac{\operatorname{Var}(\bar X)}{\varepsilon^2} = \frac{\sigma^2/n}{\varepsilon^2} = \frac{\sigma^2}{n\,\varepsilon^2} \to 0 \quad \text{as } n\to\infty. \]

The law of large numbers simply states that the probability of small deviations of the sample mean from the population mean can be made very small if we choose a large enough sample size.

Example 2.5 Suppose \(X_1,\ldots,X_n\;\text{iid}\sim \text{Exponential}(\mu)\). Then \(\mathbb E[X_i]=\mu\), therefore \(\mathbb E[\bar X]=\mu\). The law of large numbers says that the probability that \(|\bar X-\mu|\) exceeds a small number \(\varepsilon\) can become arbitrarily small by increasing the sample size \(n\). This is illustrated by the following Python code.

Code
import matplotlib.pyplot as plt
import scipy
import scipy.stats as st
import numpy as np

N = 10000  # Max sample size
mu = 1     # The mean (scale) parameter
x = st.expon.rvs(size=N, scale=mu)
xbar = (x.cumsum()) / (np.arange(1, N+1))

plt.plot(xbar, label='Running mean $\\bar X_n$')  # xbar at n = 1,2,...,N
plt.axhline(mu, ls='--', color='k', label='True mean $\\mu$')
plt.xlabel('n')
plt.ylabel('Mean')
plt.legend()
plt.title('Law of Large Numbers for Exponential($\\mu$)')
plt.show()

Example 2.6 The game of roulette — and why the house always wins. In the game of roulette, a wheel consisting of 37 pockets, numbered 0 to 36, is spun and a ball is dropped onto it (see Figure 1.3). The ball will eventually come to rest in one of the numbered pockets. Players can bet money on the outcome of the spin and win money if they guess correctly.

Suppose a player bets £1 on a specific number \(x\). This player will win £35 if the ball lands in \(x\), otherwise, they lose their bet of £1. In other words, their profit is \(+35\) if they win the bet and \(-1\) if they lose the bet. Let \(X\) denote the outcome of the wheel spin, and let \(W\) denote the player’s winnings after one bet. The expected value of \(W\) is \[ \mathbb E[W] = 35\,\Pr\{X=x\} - 1\cdot\Pr\{X\ne x\} = 35\cdot\frac{1}{37} - 1\cdot\frac{36}{37} = \frac{35-36}{37} = -\frac{1}{37} \approx -0.027. \]

We observe that the expected winnings, from a player’s point of view, are negative. This does not mean that a player loses money at every bet, and in fact, it is possible that any one player will win big. However, in a typical day, there are thousands of bets taking place. The average winnings from these bets will converge to the distribution mean of \(-0.027\), so collectively every player loses about 2.7 pence per £1 bet on average.

Top-down illustration of a European roulette wheel with pockets 0–36.
Figure 2.3: Roulette wheel

The sample mean is ubiquitous in statistics and it is important to know its sampling distribution. The next theorem summarises the large-sample behaviour of the sample mean.

ImportantCentral limit theorem

Theorem 2.2 Let \(X_1,\ldots,X_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2<\infty\). Then, the sampling distribution of the sample mean \(\bar X\) can be approximated by the normal distribution with mean \(\mu\) and variance \(\sigma^2/n\), i.e., \(\mathcal N(\mu,\sigma^2/n)\), for large sample size \(n\).

Formally, let \(Z_n = \sqrt{n} (\bar X-\mu)/\sigma\). Then, for any \(z\in\mathbb R\), \[ \Pr\{Z_n<z\} \to \Phi(z) \quad \text{as } n\to\infty, \] where \(\Phi\) denotes the CDF of the \(\mathcal N(0,1)\) distribution.

In other words, the central limit theorem says that the CDF of \(\bar X\) and the CDF of \(\mathcal N(\mu,\sigma^2/n)\) are visually indistinguishable for large sample size. Since in many cases we cannot come up with the sampling distribution of the sample mean, the approximate normal distribution can be used assuming that the sample size is large.

TipNot examinable

Proof. We will prove this theorem by showing that the moment generating function (mgf) of \(Z_n\), \(M_n(t)\), converges, as \(n\to\infty\), to the moment generating function of \(\mathcal N(0,1)\). Since the mgf determines the distribution of the random variable uniquely, it follows that the limiting distribution of \(Z_n\) is \(\mathcal N(0,1)\). Without loss of generality, we can assume \(\mu=0\). In this case \(Z_n = \sqrt{n}\,\bar X/\sigma = \sum X_i/(\sqrt{n}\,\sigma)\). If \(\mu\ne 0\), we can apply the theorem to the random variables \(Y_i = X_i-\mu\) and then substitute \(\bar Y\) with \(\bar X-\mu\).

Let \(M_X(t)\) denote the mgf of \(X_i\), i.e., \(M_X(t)=\mathbb E[e^{tX_i}]\). By the properties of the mgf, the mgf of \(Z_n\) is \[ M_n(t) = \mathbb E\big[e^{t Z_n}\big] = \prod_{i=1}^n M_X\!\left(\tfrac{t}{\sqrt{n}\,\sigma}\right) = \left\{ M_X\!\left(\tfrac{t}{\sqrt{n}\,\sigma}\right) \right\}^{n}. \] The mgf of \(\mathcal N(0,1)\) is \(M(t)=\exp(t^2/2)\). We will show that \(\lim_{n\to\infty} \log M_n(t) = \log M(t)\), i.e., \[ \lim_{n\to\infty} n\, \log M_X\!\left(\tfrac{t}{\sqrt{n}\,\sigma}\right) = \tfrac{t^2}{2}. \] Let \(u=1/\sqrt{n}\) and consider the limit \(\lim_{u\to 0} \dfrac{\log M_X\!\left(\tfrac{tu}{\sigma}\right)}{u^2}\). Using L’Hôpital’s rule twice (and the facts \(M_X(0)=1\), \(M_X'(0)=\mathbb E[X_i]=0\), \(M_X''(0)=\mathbb E[X_i^2]=\sigma^2\)), we obtain the desired limit \(t^2/2\).

Example 2.7 Suppose \(X_1,\ldots,X_n\;\text{iid}\sim \text{Exponential}(\mu)\). Then \(\mathbb E[X_i]=\mu\) and \(\operatorname{Var}(X_i)=\mu^2\). By the central limit theorem, the distribution of \(\bar X\) is approximately \(\mathcal N(\mu,\mu^2/n)\) for large \(n\). This is illustrated by the following Python code.

Code
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st

N = 10000  # Number of repetitions
n = 50     # Sample size for each repetition
mu = 1.0   # The mean (scale) parameter

x = st.expon.rvs(size=(N, n), scale=mu)
xbar = x.mean(axis=1)  # Sample mean across rows

xx = np.linspace(xbar.min(), xbar.max(), 200)
plt.hist(xbar, density=True, bins=50, alpha=0.5, facecolor='gray', label='Simulated $\\bar X$')
plt.plot(xx, st.norm.pdf(xx, mu, mu/np.sqrt(n)), 'r-', lw=2, label='Normal approx')
plt.legend()
plt.title('Sampling distribution of $\\bar X$ vs Normal approximation')
plt.show()

2.3 Exercises

  1. A coffee shop buys roasted coffee from a supplier. In order to assess the quality of the supplied coffee, the manager of the shop conducts a tasting experiment where she selects a small portion of coffee beans from different batches and tastes the coffee from each portion. For each portion she gives a score in the scale \(1,2,\ldots,10\) with 10 corresponding to coffee of the best taste and uses the results to assess the quality of the coffee. Identify the population, parameter, and statistic.

  2. Read the abstract of the article: Dietary Intake of Marine n-3 Fatty Acids, Fish Intake, and the Risk of Coronary Disease among Men by Ascherio and others published in The New England Journal of Medicine in 1995. Identify the population, parameter, sample, and statistic.

  3. Let \(X_1,\ldots,X_n\;\text{iid}\sim \mathcal N(\mu,\sigma^2)\). Derive the sampling distribution of \(\bar X\) given in Example 1.4.

  4. Let \(X_1,\ldots,X_n\;\text{iid}\sim \text{Bernoulli}(p)\).

    1. Derive the sampling distribution of \(\bar X\), i.e., for \(x\in\{0/n,1/n,2/n,\ldots,n/n\}\) find the probability \(\Pr\{\bar X=x\}\).
      Hint. Let \(W=\sum_{i=1}^n X_i\) so that \(\bar X=W/n\). First find the distribution of \(W\) and use that to find \(\Pr\{\bar X=x\}\).

    2. Derive the asymptotic distribution of \(\bar X\) from the central limit theorem.

    3. Draw a graph of the exact and approximate CDFs when \(n=20\) and \(p=0.4\).